Skip to content

Conversation

@singankit
Copy link
Contributor

@singankit singankit commented Jul 24, 2025

Fixes #

Changes

PR Proposed a way to capture Evaluation Results for GenAI Applications.

Prototype: https://github.com/singankit/evaluation_results

Note: if the PR is touching an area that is not listed in the existing areas, or the area does not have sufficient domain experts coverage, the PR might be tagged as experts needed and move slowly until experts are identified.

Merge requirement checklist

  • CONTRIBUTING.md guidelines followed.
  • Change log entry added, according to the guidelines in When to add a changelog entry.
    • If your PR does not need a change log, start the PR title with [chore]
  • Links to the prototypes or existing instrumentations (when adding or changing conventions)

Copy link
Member

@lmolkova lmolkova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there publicly available prototypes of the code emitting evaluation results? Please link them in the PR description

@singankit singankit marked this pull request as ready for review August 5, 2025 20:16
@singankit singankit requested review from a team as code owners August 5, 2025 20:16
@singankit singankit changed the title Gen AI Evaluation Event Gen AI Evaluation Result Aug 5, 2025
@github-actions github-actions bot added enhancement New feature or request area:gen-ai labels Aug 5, 2025
@dmontagu
Copy link

dmontagu commented Aug 6, 2025

One issue I have with creating a separate span for tracking each evaluation score is that it makes it harder (at least with the way we index spans..) to write at least a couple classes of queries for specific cases:

  • Find all cases where the total number of tokens for the task is above a threshold and a score is below a threshold (or vice versa)
    • This would require comparing information about token usage (or some other attribute, in the abstract) from the task execution span with the evaluation score from the evaluation score span
  • Find all cases where score A is above a threshold and score B is below a threshold
    • This would require comparing two different evaluation score spans

It would work way better for us if there was a way that we could, while complying with the semantic conventions, put all this information as attributes of a single span so that we can query it at once. I guess we can do that in addition to complying with the semantic convention, this just sticks out to me as an unfortunate aspect of this design.

@singankit
Copy link
Contributor Author

One issue I have with creating a separate span for tracking each evaluation score is that it makes it harder (at least with the way we index spans..) to write at least a couple classes of queries for specific cases:

  • Find all cases where the total number of tokens for the task is above a threshold and a score is below a threshold (or vice versa)

    • This would require comparing information about token usage (or some other attribute, in the abstract) from the task execution span with the evaluation score from the evaluation score span
  • Find all cases where score A is above a threshold and score B is below a threshold

    • This would require comparing two different evaluation score spans

It would work way better for us if there was a way that we could, while complying with the semantic conventions, put all this information as attributes of a single span so that we can query it at once. I guess we can do that in addition to complying with the semantic convention, this just sticks out to me as an unfortunate aspect of this design.

Thank you for your detailed feedback. I agree that having all relevant metrics as attributes on a single span would simplify querying and analysis, however not fully sure if this approach is flexible to cover various different scenario especially with async evaluations

To clarify with a concrete scenario:

  • On Day 1, I select 3 evaluation metrics for my GenAI application.
  • On Day 5, I realize that 3 metrics are not sufficient and decide to add 2 more, bringing the total to 5 metrics.

If I need to retroactively compute the 2 new metrics on existing traces that already contain the original 3 metrics, would the recommended approach be to generate new spans for these additional metrics? Or is there a preferred way to update the original span with the new evaluation results, while still adhering to the semantic conventions?

A similar situation could arise if some evaluation metrics are computed asynchronously or by a downstream service at different times. In such cases, would each metric (or set of metrics computed together) necessarily require a separate span, or is there flexibility to consolidate them as attributes on the original span?

Thanks again for your insights. I’m keen to understand the best practices for handling evolving needs of asynchronous evaluation workflows in line with the semantic conventions.

  • Would evaluation scores as events help for this case? If yes I can start a new issue to discuss if evaluations should be emitted as events in addition to span

@dmontagu
Copy link

dmontagu commented Aug 6, 2025

If I need to retroactively compute the 2 new metrics on existing traces that already contain the original 3 metrics, would the recommended approach be to generate new spans for these additional metrics? Or is there a preferred way to update the original span with the new evaluation results, while still adhering to the semantic conventions?

For the most part, I typically think of traces as static/immutable after their root span closes (I know that's not technically the case, and there are reasonable scenarios where that is explicitly avoided, but still). I think we should just stick to that mental model here — if you want to "mutate" an evaluation run/experiment/whatever-you-want-to-call-it, my personal feeling is that that should happen in some application-layer logic, and just rely on OTel for a static record of what happened when it happened. For example, you could create a new evaluation run where you just copy the old run's outputs for the metrics you've already computed, and compute new ones where you want.

I would personally have no problem just generating a new trace for the updated evaluation results (even if that meant copying execution data from an older trace or otherwise referencing it via span links or something), I don't need to extend an old trace. Maybe others feel differently, but just sharing my opinion.

I'll note that the idea of adding additional metrics also feels somewhat awkward because, I can always add new metrics to an old trace, but what about redefining an existing metric? Presumably that opens more of a can of worms about what it means to "overwrite" old spans? I personally feel it's better to avoid the whole issue by not encouraging this pattern. Just my 2c

@singankit
Copy link
Contributor Author

If I need to retroactively compute the 2 new metrics on existing traces that already contain the original 3 metrics, would the recommended approach be to generate new spans for these additional metrics? Or is there a preferred way to update the original span with the new evaluation results, while still adhering to the semantic conventions?

For the most part, I typically think of traces as static/immutable after their root span closes (I know that's not technically the case, and there are reasonable scenarios where that is explicitly avoided, but still). I think we should just stick to that mental model here — if you want to "mutate" an evaluation run/experiment/whatever-you-want-to-call-it, my personal feeling is that that should happen in some application-layer logic and just rely on OTel for a static record of what happened when it happened. For example, you could create a new evaluation run where you just copy the old run's outputs for the metrics you've already computed and compute new ones where you want.

I would personally have no problem just generating a new trace for the updated evaluation results (even if that meant copying execution data from an older trace or otherwise referencing it via span links or something), I don't need to extend an old trace. Maybe others feel differently but just sharing my opinion.

I'll note that the idea of adding additional metrics also feels somewhat awkward because, I can always add new metrics to an old trace, but what about redefining an existing metric? Presumably that opens more of a can of worms about what it means to "overwrite" old spans? I personally feel it's better to avoid the whole issue by not encouraging this pattern. Just my 2c

The proposal in this PR is consistent with your feedback on traces as static/immutable. Proposal says keep the span being evaluated untouched but add a link in evaluation span which helps identify span that is being evaluated. Example below shows it. This has flexibility to add more evals as needed.

Span Being Evaluated Evaluation Span
     {
    "name": "chat gpt-4o",
    "context": {
        "trace_id": "0xeb0fdf5670975fea194b2eef13e789c6",
        "span_id": "0x63e929946253cf52",
        "trace_state": "[]"
    },
    "kind": "SpanKind.CLIENT",
    "parent_id": null,
    "start_time": "2025-08-04T23:45:55.959766Z",
    "end_time": "2025-08-04T23:45:57.512342Z",
    "status": {
        "status_code": "UNSET"
    },
    "attributes": {
        "gen_ai.operation.name": "chat",
        "gen_ai.system": "openai",
        "gen_ai.request.model": "gpt-4o",
        "server.address": "anksing1rpeastus2.openai.azure.com",
        "_MS.sampleRate": 100.0,
        "gen_ai.response.model": "gpt-4o-2024-11-20",
        "gen_ai.response.finish_reasons": [
            "stop"
        ],
        "gen_ai.response.id": "chatcmpl-C0zAK9AqI5MpwVScha6p1Dm5BgO7R",
        "gen_ai.usage.input_tokens": 15,
        "gen_ai.usage.output_tokens": 114
    },
    "events": [],
    "links": [],
    "resource": {
        "attributes": {
            "telemetry.sdk.language": "python",
            "telemetry.sdk.name": "opentelemetry",
            "telemetry.sdk.version": "1.36.0",
            "service.name": "unknown_service"
        },
        "schema_url": ""
    }
}
      
      {
    "name": "evaluation relevance",
    "context": {
        "trace_id": "0x6ebb9835f43af1552f2cebb9f5165e39",
        "span_id": "0x89829115c2128845",
        "trace_state": "[]"
    },
    "kind": "SpanKind.INTERNAL",
    "parent_id": null,
    "start_time": "2025-08-04T23:48:35.592833Z",
    "end_time": "2025-08-04T23:48:35.592833Z",
    "status": {
        "status_code": "UNSET"
    },
    "attributes": {
        "_MS.sampleRate": 100.0,
        "gen_ai.operation.name": "evaluation",
        "gen_ai.evaluation.name": "relevance",
        "gen_ai.evaluation.score": 4,
        "gen_ai.evaluation.label": "Pass",
        "gen_ai.evaluation.reasoning": "Response is relevant to the query."
    },
    "events": [],
    "links": [ // Added Links
        {
            "context": {
                "trace_id": "0xeb0fdf5670975fea194b2eef13e789c6",
                "span_id": "0x63e929946253cf52",
                "trace_state": "[]"
            },
            "attributes": {
                "gen_ai.operation.name": "evaluation"
            }
        }
    ],
    "resource": {
        "attributes": {
            "telemetry.sdk.language": "python",
            "telemetry.sdk.name": "opentelemetry",
            "telemetry.sdk.version": "1.36.0",
            "service.name": "unknown_service"
        },
        "schema_url": ""
    }
}
      

what about redefining an existing metric?

  • Would this be a good candidate for metric versioning? For version 2 there can a new evaluation span that links to span being evaluated?

Appreciate you bring up the scenarios and use cases. It will help shape this work. :)

@zhirafovod
Copy link

Looks good to me overall

@github-project-automation github-project-automation bot moved this from Untriaged to Needs More Approval in Semantic Conventions Triage Aug 26, 2025
@singankit
Copy link
Contributor Author

Thank you all for the valuable feedback and thoughtful discussion that helped bring this PR to a merge-ready state.
@lmolkova ,could you please proceed with merging the PR at your convenience? It has received the required approvals.

@lmolkova lmolkova added this pull request to the merge queue Aug 26, 2025
Merged via the queue into open-telemetry:main with commit ebbf315 Aug 26, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:gen-ai enhancement New feature or request

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

10 participants